(Un) detectable cluster structure in sparse networks 
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We study the problem of recovering a known cluster structure in a sparse network, also known 
as the planted partitioning problem, by means of statistical mechanics. We find a sharp transition 
from un-recoverable to recoverable structure as a function of the separation of the clusters. For 
multivariate data, such transitions have been observed frequently, but always as a function of the 
number of data points provided, i.e. given a large enough data set, two point clouds can always be 
recognized as different clusters, as long as their separation is non-zero. In contrast, for the sparse 
networks studied here, a cluster structure remains undetectable even in an infinitely large network 
if a critical separation is not exceeded. We give analytic formulas for this critical separation as 
a function of the degree distribution of the network and calculate the shape of the recoverability- 
transition. Our findings have implications for unsupervised learning and data-mining in relational 
data bases and provide bounds on the achievable performance of graph clustering algorithms. 
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In any branch of science, exploratory data analysis often 
starts with clustering. Supposing the data are clustered, 
a natural question is whether one can at all hope to re- 
cover the undcrlyingstructure of the data from a finite 
number of samples This question has been studied 
extensively by Physicists in the following setting: Given 
aN data points from a known probability distribution 
forming clusters in an N dimensional space {e.g. a mix- 
ture of Gaussians), can we recover the parameters of the 
probability distribution from the given data alone and 
can we label the data points correctly as belonging to one 
of the point clouds? The answer that has been given is 
generally yes. For any non-zero separation of the clusters 
in a multidimensional space, one can learn the parame- 
ters of the underlying distribution if only the number of 
data points is large enough, i.e. one observes a transition 
from unrecoverable to recoverable structure as a function 
of a Blii- 

In this contribution, we study the problem of recover- 
ing a known cluster structure in networks. This problem 
has received considerable attention recently under the 
term "community detection" [(| 0, @, in the research 
on the topology of complex networks. If the clusters are 
all of equal size, this problem is known as the "planted 
partition problem" in computer science where the most 
studied case is the planted bisection problem in which 
all the nodes in the network are members of one of only 
two clusters. Any pair of nodes from the same cluster 
is connected with probability p, while any pair of nodes 
from distinct clusters is connected with probability r < p. 
As an example, Onsjo and Watanabe provide an algo- 
rithm that provably recovers the planted solution with 
probability > 1 - S if p - r = ^iV" 1 / 2 \ g(N/S)) [HJ. 
Other autho rs p resent different algorithms with similar 
bounds [H Gj. A gain, if the data set, i.e. the num- 
ber of nodes N, is large enough, the two clusters can be 



recovered regardless of the strength of their separation 
{p — r). The above bound applies only to dense net- 
works in which the average number of connections per 
node (k) grows linearly with the number of nodes in the 
network. However, most real world networks on which 
clustering is performed are sparse and have link densi- 
ties of the order of 1/N in which case the above bound 
is meaningless [l3, 14 1. The mean connectivity of sparse 
networks does not grow as the system size. Consider for 
example the world wide web: doubling the number of 
web pages will not lead to doubling the number of links 
a single page lists or receives on average. We will show 
that for sparse networks, a pre-defined cluster structure 
remains unrecoverable even in the limit of infinite N, if 
the probability for an intra-clustcr link does not exceed 
a critical value p\ n which depends on the degree distri- 
bution p(k). We will calculate p c in and the shape of the 
transition analytically as a function of p(k). 

Specifically, we consider the problem of recovering the 
pre-defined cluster structure in infinitely large sparse net- 
works for which a degree distribution p(k) is given and 
the same in all clusters. The average connectivity per 
node (fc) is assumed to be finite. We parameterize the 
planted cluster structure of the network by the number 
of clusters q and the probability pi n that a given edge lies 
within one of q equal sized clusters. Every node i carries 
an index Sj € {X,...,q} indicating the cluster to which it 
belongs by design. For pi n = 1/q the pre-defined clus- 
ter structure cannot be recovered by definition, while for 
Pi n = 1 recovery is trivial as our network consists of q 
disconnected parts. Given such a network, we are inter- 
ested in finding a partition, i.e. an assignment of indices 
Ui 6 {l,...,q} to the N nodes of the network, such as 
to maximize the accuracy A = J2i fist,ai/N of recovering 
the planted solution. Since we cannot assume knowledge 
of pi n , the best possible approach is to find a partition 
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that minimizes the number of edges running between dif- 
ferent parts, i.e. a minimum cut-partition. Naively, one 
would expect the overlap of the minimum cut partition 
with the planted solution and hence the accuracy to in- 
crease steadily with pi n between 1/q and 1. However, 
we will show that for sparse networks, the minimum cut 
partition is uncorrelated with the planted partition un- 
til pi n exceeds some critical value p\ n which depends on 
p(k). Hence, below p\ nl the planted solution is impossi- 
ble to recover. We will calculate p\ n and the maximum 
achievable accuracy as as a function of pi n and p(k). 

Let us formulate the problem of finding a minimum 
cut partition as finding the ground state of the following 
ferromagnetic Potts Hamiltonian: 



Tl-Part = — J]] Jij&oi^j + Constraint. 

i<j 



(1) 



Here, is the {0, 1} adjacency matrix of the graph and 
the constraint enforces a zero-magnetization ground state 
corresponding to an equi-partition. For graphs without 
cluster structure, i.e. pi n = 1/q, this problem has been 
studied extensively for Poissonian degree distributions or 
Bethe lattices with a fixed valence 3, ljl 17,18, . A 
recent result generalizes to arbitrary degree distributions 
[ioj ]. Note that the energy of the planted partition is 
E p = pi n (k)/2 with (k). To study the ground state of 
([T]) we employ the Bethe-Peirls approach from statistical 
mechanics, also known as the cavity method or belief 
propagation, directly at zero temperature 21|. At an 
informal level, for the ferrogmagntic system studied here, 
this method can be described as the following process: 
nodes are assumed to pass messages u among each other 
across the links of the network. A message from node 
i to j is a q-dimensional vector of zeros and ones. An 
entry of one in component s of u indicates to node j that 
node i would like node j to assume state aj = s. To 
generate this message to j, node i has taken all messages 
coming from all other nodes k ^ j connected to i and 
summed them to obtain a so-called cavity field h^^j = 
E/c^j ■ Jki^k~>i- Then, node i converts this cavity field 
into a message to j via u,-_>j = u(hj_>j). In our case, the 
function u is defined via 



v(h) = max(/r, h q ), 
u s (h) = max(/i 1 ,/i s + 1, 



(2) 
(3) 



This means that u picks the maximum components in h 
and sets all corresponding components in u to one and 
the rest to zero. Due to possible degeneracy in the com- 
ponents of h, the vector u = u(h) may have more than 
one non-zero entry and is never completely zero. This 
observation is fundamental for all further developments. 
The field components of h take only integer values, be- 
cause we only have integer couplings between the 
spins. This process of message passing is iterated until a 
stationary state is reached corresponding to the replica 



symmetric ground state. It is fully described by the prob- 
ability distribution Q s (u) of messages being sent in the 
system. The superscript s denotes a possible dependence 
of this distribution on the index of the pre-defined clus- 
ter to which the sending node belongs. An easy to follow 
formal derivation of a set of self-consistent integral equa- 
tions for Q s (u) can be found in Refs. 22, 23[. It is gen- 



eral in that the particular form of the Hamiltonian enters 
only via the functions v(h) and u(h) and is therefore not 
repeated here. 

There are 2 q — 1 possible messages u. Since the prob- 
abilities of sending them may depend on the planted 
cluster from which they are sent, there are in principle 
q(2 q — 1) different probabilities Q s (u) to determine. We 
are only interested in distributions that allow to fulfill 
the constraint of an equi-partition and that are symmet- 
ric under permutation of the indices as is our planted 
cluster structure. These conditions reduce the number of 
different probabilities Q s (u) to only 2q — 1 parameters 



5 (u) = T] cw , where c : 



and w 



(4) 



Here, u s denotes the s th component of the message vec- 
tor u under consideration. Without loss of generality, we 
have thus implicitly introduced a preferred direction for 
each planted cluster. The probability Q s (u) that a node 
from planted cluster s sends a message u depends only 
on whether or not u has an entry of one in the "correct" 
component s (c = 1) and on how many "wrong" compo- 
nents w in u carry an entry of one (w £ {1— c, .., q— 1}). It 
is understood that for p m 1 we have 7710 — * 1, i.e. only 
"correct" messages are sent. Equivalently, for p in — > 1/q 
we must have 771, Q _i = 770, a = Vr> *-e- the probability of 
a message depends only on the number r = w + c of non- 
zero entries in it. The 2q — 1 new order parameters rj cw 
which describe Q s (u) obey the following normalization: 



1 9-1 
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c=0 w=l— 



w 



(5) 



Let us now turn to the problem of a planted bisection, 
i.e. the case of two clusters. Then, we only have three 
possible messages u £ {(1,0), (0, 1), (1, 1)} and three or- 
der parameters tj cw . The self consistent integral equa- 
tion for Q s (u) can then be written as a set of polynomial 
equations for the 7] cw in a simple way: 
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Together with the normalization condition 1 = 7/10 + 
V01 +7711 this formes a closed set of equations in which 



3 




0.5 0.6 0.7 0.8 0.9 1 0.8 0.85 0.9 0.95 1 0.5 0.6 0.7 0.8 0.9 1 



Pin Pin Pin 

FIG. 1: Left: The order parameters r/ cw for the planted bisection problem on a random Bethe lattice with k — 3 links per 
node as a function of pi n . The planted cluster structure in the network does not influence the ground state configuration until 
a critical value of Pi n is reached. Middle: The ground state energy E of |T|) and the energy of the planted cluster structure 
E p vs. pi n . The left vertical blue line indicates the critical value of p c iri beyond which 7710 > 7701 and E < E Rnd and the 
planted cluster structure starts to influence the ground state energy. The right vertical blue line indicates the naive value of 
p? n = 2E Rnd /{k) beyond which E p < E Rnd . Right: The accuracy with which the planted cluster structure may be recovered. 
Again, the two vertical lines indicate p c in and p" n . 



q(d) = (d + l)p(d + l)/{k) denotes the excess degree 
distribution and we have used the abbreviations t]"q = 

PmVw + (1 - p ln )mi an d Vol = PinVoi + (1 - Pin)r)l0- 
Equations (J5][7]) are easily solved for any value of p,„ and 
any degree distribution p(k) by iteration. We see for 
Pin = 1/2, we must have 7710 = 7701 = 771 and only one 
independent order parameter remains. 

We assume node i is assigned state ai correspond- 
ing to the maximum component of the effective field 
h cff = J2j Jji u j • <• In case of degeneracy, <7j is cho- 
sen with equal probability among the different maximum 
components. Given the distribution Q s (u), we can thus 
calculate the probability that a node is assigned into cor- 
rect pre-defined cluster p(<Ji = s\s) from which the accu- 
racy follows. The ground state energy of the partitioning 
problem is then given by: 



E 



(k) 



(l + 2(X - 77107701 ) - (1 - Pi»)(77lO - 7701 ) 2 ) , (8) 



where we have introduced X as an abbreviation for 
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In case of a Poissonian degree distribution p(k) — 
e~ x A k jk\ with mean A, we can express this using Modi- 
fied Bessel Functions of the first kind I\(n, x): 



\/4»- H1 - Vll) h (l, 2A y/v&vi&) ■ (10) 



Let us denote with E Rnd the ground state energy for 

Pin = 1/V 

Figure [1] shows the order parameters, the ground state 
energy and the achievable accuracy of recovering the 
planted bisection as a function of pi n for a random Bethe 
lattice with exactly three neighbors per node. First, the 
order parameters 7710 and 7701, i.e. the probabilities of 
sending a message indicating the correct or wrong clus- 
ter, respectively, are equal until a critical value of p\ n 



is reached. For more than two clusters, we also observe 
this bifurcation for the order parameter pair 771,^-1, 770, to- 
Second, the ground state energy remains on the level of 
E Hnd until pi n > p c in . Third, the ground state configura- 
tion has only random overlap with the planted partition, 
as seen from the plot of the accuracy, until pi n > p\ n . 
This means that as long as Pi n < p? n , the planted parti- 
tion does not influence the ground state and is hence not 
detectable! The value of p c in —7/8 at which the planted 
solution starts to influence the ground state is smaller 
than the naive guess pf n = 2E Rnd /(k) =25/27, the value 
for which the planted solution starts to have an energy 
below E Rnd = 25/18. 

Let us now study how the critical value p c in changes 
with the degree distribution. At the transition point, we 
can set 7710 = 77oi+<5 ~ 771. Then we have rf$ = rjiQ—5p ou t 
and t/qI = 7/10 — Spin- Inserting these expressions in ([6][7]) 
and expanding for small 6 we arrive after at: 
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Here, we use with 771 and 772 the order parameters that 
we calculate for pi n — 1/2 and that remain valid for all 
Pin < Pin- Again, expression (fTTj) is easily calculated 
for any degree distribution. In case of a Poissonian de- 
gree distribution p(k) — e~ x \ k /k\ with mean A, we can 
simplify (|TT|) to 



(Pin Pout ) 



771 



(12) 



Figure [2] shows the dependence of p c in on the degree 
distribution. As a general feature p\ n decreases with in- 
creasing (k). However, the critical pi n for distributions 
with fat tails is lower than for networks with a Poisso- 
nian degree distribution. Note the correspondence to the 
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FIG. 2: Left: The critical value of pi n beyond which the cluster structure starts to influence the ground state of the bisection 
problem, i.e. below which clusters cannot be detected. We compare Erdos Renyi graphs (ER) with a Poissonian degree 
distribution p(k) — e _A A fe /fc! and two types of scale free degree distributions. The first one being a stretched power law (SF 
Ak) of form p(k) — (k + Ak) ' with Ak £ [2,50], and the second (SF k min ) being of the form p(k) = k ' with a varying 
minimum degree k m i n with fc min £ [2,30]. For both scale free distributions we choose 7 = 3. Since we are interested only in 
the behavior of the giant connected component, we set p(k = 0) = in all cases. Middle: The ratio of p\ n and p™„, the 
naive estimate for the transition point p" n = 2E Rnd / {k) which always overestimates p\ n . Right: Achievable accuracy for the 
planted partition problem on ER graphs with TV — > oo, (k) — 16 and 4 equal sized clusters and numerical results obtained from 
the best graph clustering algorithms on equivalent networks with TV = 128 nodes 6, 7]. We attribute the observed differences 
to finite size effects. 



results in Ref. [20( on the cut-size of these graphs. The 
critical value of pi„ is smaller, i.e. clusters are easier to 
detect, for networks with degree distributions which are 
harder to cut. Ref. [2(J suggests a universal dependence 
of E Rnd on (vfe) based on a replica calculation. Our cal- 
culations here support this result. As the middle panel of 
figure shows, in the limit of large (k) the naive estimate 
Pin ~ Pin = 2E Rnd /(k) provides a good, but conserva- 
tive, approximation. 

All the results described here analytically for two clus- 
ters can be obtained for more than two clusters by an 
efficient population dynamics algorithm which will be 
described elsewhere [241 ] . As an example, the right panel 
shows the maximum theoretically attainable accuracy for 
a commonly used benchmark in graph clustering or com- 
munity detection 0, Q . While the actual benchmark uses 
networks with 128 nodes in 4 clusters and an average of 
16 links per node, we calculate the accuracy for an in- 
finitely large network with the same number of clusters 
and degree distribution. We recover the transition point 
and the upper part of the transition from the best avail- 
able graph clustering algorithms @, 0]. Given the fact 
that the numerical experiments were obtained on a rela- 
tively small network and our theory applies to the ther- 
modynamic limit, the aggreement between theory and 
experiment is remarkably good. 

In summary, we have shown that the sparsity of a net- 
work may limit the use unsupervised clustering methods 
may have. Even though cluster structure is present, it 
remains undetectable and hidden behind alternative so- 
lutions to the clustering problem that have zero correla- 
tion with the true solution. If we were to draw an anal- 
ogy to unsupervised learning problems on multivariate 
data, we could say the average connectivity of a network 
plays the role of the ratio a between the number of data 



points and the dimensionality of a multivariate data set. 
The fundamental difference is that the average connec- 
tivity is not a free parameter in sparse networks and can- 
not be increased by adding more nodes to the network. 
Adding nodes to the network inevitably increases the di- 
mensionality of the data. Thus we are dealing with a 
qualitatively different phenomenon. Our results may be 
valuable for the design of network clustering algorithms 
and their benchmarks as well as for a critical assessment 
of the amount of information that can be derived from 
unsupervised learning or data-mining on networks. 

We thank David Saad, Wolfgang Kinzel and Georg 
Reents for stimulating discussions. 
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